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LARGE DEVIATIONS FOR WEIGHTED EMPIRICAL 
MEASURES ARISING IN IMPORTANCE SAMPLING 

CN " HENRIK HULTt AND PIERRE NYQUIST 

o 

■ ABSTRACT. Importance sampling is a popular method for efficient computa- 
tion of various properties of a distribution such as probabilities, expectations, 

^ 1 quantiles, etc. The output of an importance sampling algorithm can be rep- 

, resented as a weighted empirical measure, where the weights are given by the 

likelihood ratio between be original distribution and the sampling distribution. 
OO In this paper the efficiency of an importance sampling algorithm is studied by 

means of large deviations for the weighted empirical measure. The main re- 
sult, which is stated as a Laplace principle for the weighted empirical measure 
Q^i arising in importance sampling, can be viewed as a weighted version of Sanov's 

■ theorem. The main theorem is applied to quantify the performance of an im- 
portance sampling algorithm over a collection of subsets of a given target set 

_ as well as quantile estimates. The proof of the main theorem relies on the 

, weak convergence approach to large deviations developed by Dupuis and Ellis. 



1. Introduction 

Computational complexity is a central issue in the design of modern technology 
and systems. Cheaper and smaller devices enable us to collect and transmit huge 
amounts of data. The data can be utilized to make systems faster, safer, and more 
versatile. In order to use the data effectively we need fast and reliable techniques 
for analyzing the data and to perform advanced computational tasks. This paper 
contributes to the development of a new approach, based on the theory of large 
deviations for empirical measures, to the analysis of efficient computational methods 
within the context of stochastic simulation. In the present paper the emphasis is 
on efficiency and design for algorithms based on importance sampling. 

Stochastic simulation is the collective term for simulating a physical system, 
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involving random effects, on a computer. Computational methods based on sto- 
chastic simulation are fundamental in a wide range of applications, virtually all 
areas where probability is applied, including chemistry, computer science, finance, 
life sciences, networks, physics, power grids, reliability, solid mechanics, statistics, 
etc. The basic idea in stochastic simulation is to generate a population of particles 
that moves randomly according to the laws of the physical system. Each particle 
carries an individual weight, which may be updated during the simulation, and 
quantities of the underlying physical system are computed by averaging the parti- 
cles' weights depending on their position. The standard example is Monte Carlo 
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simulation where all the particles are independent and statistically identical and 
their weights are constant and equal. 

Although the standard Monte Carlo procedure is widely used it is by no means 
universally applicable. One reason is that particles may wander off to irrelevant 
parts of the state space, leaving only a small fraction of relevant particles that 
contribute to the computational task at hand. Therefore standard Monte Carlo 
may require a huge number of particles to obtain a desired precision, resulting in a 
computational cost that is too high for practical purposes. A control mechanism is 
needed that forces the particles to move to the relevant part of the space, thereby 
increasing the importance of each particle and reducing the computational cost. 
The control mechanism may come in different form depending on the type of al- 
gorithm under consideration. In importance sampling, see e.g. pQ, the control is 
the choice of sampling dynamics used to steer the particles towards the relevant 
part of the state space. In splitting algorithms [5 and interacting particle systems 
[2] the control mechanism come, roughly speaking, in the form of a birth/death 
mechanism, which controls that important particles give birth to new particles and 
irrelevant particles are killed. 

The limited evidence provided by simply running numerical experiments has 
generated the need for a deeper theoretical understanding and analysis of the per- 
formance of stochastic simulation algorithms, see e.g. pQ. Much of the theoretical 
analysis on the efficiency of stochastic simulation algorithms in general, and im- 
portance sampling algorithms in particular, is based on analyzing the variance of 
the resulting estimators. The variance is the canonical measure of variability of 
unbiased estimators, but when an estimator is biased or skewed the variance can 
be misleading. This paper aims to complement the variance analysis by a detailed 
study of the rate function associated with a large deviation principle of the weighted 
empirical measure associated to the output of the algorithm. The main result is 
a Laplace principle for the weighted empirical measure resulting from a general 
importance sampling algorithm. The rate function associated to the Laplace prin- 
ciple can be used to identify what part of the design that is most likely to lead to 
computational errors and lead to a deeper understanding of how the design of an 
algorithm influences its performance. 

Next follows a brief outline of our approach. For the sake of illustration, consider 
the problem of computing the probability of an event using importance sampling. 
In importance sampling the main design choice is the sampling dynamics to be 
used for generating the trajectories of the particles. The output of an importance 
sampling algorithm is a collection of particles at different locations with individual 
weights, represented as a weighted empirical measure. If the sampling dynamics are 
well chosen many particles will be located in or near the event and the variability 
of their weights will be small. As the number of particles increase the weighted 
empirical measure will look more and more like the true distribution on the event 
and the estimated probability of the event will converge to the true probability. 
But how fast? The theory of large deviations can be used to show that, under 
certain conditions, the error probability decays exponentially fast in the number of 
particles and the associated rate function tells us what the exponential rate is. The 
rate function depends on the design of the algorithm, in this case on the choice of 
sampling dynamics. The design choice then reduces to selecting sampling dynamics 
that maximize the exponential rate of decay of errors. Moreover, the rate function 
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will typically emphasize the features of a model that are most likely to contribute 
to estimation errors and discard those features that are of less importance. Thus, 
the rate function can potentially be used to identify key aspects of the sampling dy- 
namics that will reduce the probability of errors. The key idea is that the properties 
of the rate function can be utilized to suggest new and improved algorithms. 

To avoid confusion it should be pointed out that techniques from sample path 
large deviations have been studied thoroughly, to control the rate at which the vari- 
ance decays, in the context of designing efficient rare-event simulation algorithms. 
Our objective is fundamentally different. We want to replace/complement the vari- 
ance by the rate function of a large deviations principle as the number of particles 
increase. For our purposes the appropriate framework is large deviations for em- 
pirical measures, in the spirit of Sanov's theorem (4| Theorem 2.2.1], rather than 
sample path large deviations. The suggested approach is applicable to a wide range 
of simulation problems and is not intended exclusively for problems in rare-event 
simulation. 

Our contributions can be summarized as follows. The main result in this paper 
is a Laplace principle, in the space of finite measures equipped with the r-topology, 
for the weighted empirical measure arising in importance sampling. The result 
is expected in the sense that it can be guessed from Sanov's theorem and the 
contraction principle. Our proof of the general version, stated in Theorem 13.11 
is based on the weak convergence approach to large deviations and follows, with 
some adaption, the proof of Sanov's theorem in [4j. Its relevance is mainly that 
it leads to a method for theoretical quantification of performance for importance 
sampling algorithms. The main theorem is applied to quantifying the performance 
of an importance sampling algorithm over a collection of subsets of a given target 
set as well as to quantile estimates. Furthermore, the result is potentially useful 
for theoretical comparison of the performance of algorithms of different character, 
based on, say, importance sampling and interacting particle systems, by comparing 
the associated rate functions. 

The outline of the paper is as follows. In Section[5]an introduction to importance 
sampling is presented along with background on variance based efficiency analysis 
and large deviations for empirical measures. The main result, which is a Laplace 
principle for the weighted empirical measure of an importance sampling algorithm, 
is presented in Section [3] Applications of the main result to efficiency analysis 
and design of importance sampling algorithms are given in Section 2J Most of 
the examples are well studied elsewhere and are mainly intended to demonstrate 
the efficiency analysis using the rate function in contrast to the standard variance 
analysis. Section [5] contains the proof of the main theorem. 



2. Background 

Let X be a complete separable metric space equipped with its Borel er-field B(X) 
and let (fi, JF, P) be a probability space. Unless otherwise stated subsets of X under 
consideration are always assumed in B(X). Consider a random variable X : ft — > X 
with distribution F. Denote by ftA\ the space of probability measures on X. The 
objective is to approximate F in a given region A g B(X) or to compute $(F), 
where $ : M\ — > 1Z is a given functional. Examples of functional that may be of 
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interest are expectations, 



® f (F) = / fdF, for some / : X -> Tl, 



and, when X is real valued, quantilcs 



<t> q (F) = F-\q) = inf{x : F((x, oo)) (0, 1), 



and L-statistics. 



<fr(F) - / 4>{q)F-\q)dq. 



Jo 



Only in exceptional cases is explicit computation of such functionals possible. 
When explicit computations are not possible a viable alternative is simulation. 
The standard method of simulation is Monte Carlo, in which the empirical measure 



is constructed from an independent sample X\, ...,X n from F. Here S x denotes a 
unit point mass at x. The quantity $>(F) is estimated by the plug- in estimator 
$(F„). Roughly speaking, if F„ is a good approximation of F in a region that 
largely determines &(F), then $(F„) is likely to provide a good estimate. 

2.1. Large deviations analysis for quantifying the performance of Monte 
Carlo algorithms. Let us try to quantify how efficient the plug-in estimator is by 
means of large deviations for the associated empirical measure. 

Obviously, the sample size n will affect the precision of the estimator $(F„). By 
the law of large numbers for empirical measures F„ converges weakly to F with 
probability one, as n — > oo, and an increased sample size n will thus improve the 
accuracy in this sense. In the case when f (F n ) is an unbiased estimate of $(F) 
the sample size required to reach a desired precision is well captured by Var($(F)) 
and an analysis of the performance of an estimator can be done in terms of the 
variance. However, in the general case $(F„) can be biased and it might be that 
looking solely at the variance of the estimator may be insufficient. 

An alternative way of quantifying the efficiency of the Monte Carlo estimator 
is through the theory of large deviations. To illustrate the point, let us consider 
the example of computing the expectation F(f) = J fdF for some / : X i->- 1Z by 
Monte Carlo simulation. An estimate of F(f) is then given by 



where the Xi's are independent with common distribution F. 

Cramer's theorem states that if E[exp{#/(A)}] < oo for 9 in a neighborhood of 
the origin, then F n (/) satisfies the large deviation principle: 





(2.1) 



limsup-logP(F„(/) G A) < -1(A), 



n— >oo 



liminf-logP(F„(/) e A°) > -I(A°) 



n— >oo Tl 
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for all Borel sets A, where A and A° denotes the closure and interior of A, respec- 
tively, 1(A) = mi X £A I( x ), and the rate function I is given by 

I(x) = sup{fe - n(0)}, 

9 

and k(9) = \ogK[cxp{0 f (X)}] is the logarithm of the moment generating function 
[31 p. 26]. 

Suppose for the sake of illustration that, with probability at least 1 — 5, a relative 
precision e is desired in the estimate. That is, the sample size n must be selected 
sufficiently high that 

P(|F n (/)-F(/)| >eF(f))<8. 

Cramer's theorem, with A e denoting the complement of the open ball of radius 
(-F(f) centered at F(f), implies that 

limsup - logP(|F„(/) - > eF(f)) = limsup - logP(F„(/) e A e ) < -I(A e ). 

n Tl n Tl 

(2.2) 

Then, at least approximately, 

P(F„(/)e4 ( )<e-"W, 
for large n, and the upper bound 5 on the error probability corresponds to 

n *TlA7) { ~ logS) - (2 - 3) 

Roughly speaking the sample size must be proportional to the reciprocal of the rate 
for the error probability in order to obtain the desired performance. 

Let us consider a more general example where we are interested in approximat- 
ing the true distribution F over a region A e B(X). Again the empirical measure 
F„ resulting from Monte Carlo simulation, restricted to A, provides a viable ap- 
proximation. In this context a large deviation principle for the empirical measure 
can be applied to quantify the performance of the Monte Carlo algorithm. Sanov's 
theorem [H Theorem 2.2.1] states that F„ satisfies a large deviation principle on 
Aii with rate function given by the relative entropy %(■ \ F), where 



H{G\F) = J (log^j dG, G e Mi 



More precisely, 



limsup- log P(F„ £ A) < -1(A), 

n— >oo Tl 

liminf-logP(F n £ A ) > -I(A°), 

n— >oo Tl 

where A 6 B(Mi) and 1(G) = H(G | F). 

Taking A — Ug to be a small neighborhood of G one can interpret, when n 
is large, exp{— nI(G)} as an approximation for the probability that the empirical 
measure looks like a typical sample from G. If it is undesirable that the empirical 
measure looks like a typical sample from a specific distribution G, say the objective 
is to have 



P(F n e u G ) < s, 
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then, by the same reasoning as above, the sample size must be selected sufficiently 
large that 

n>-^y(-log<5). (2.4) 

Notice that taking A = {G £ Mi : \G(f)-F(f)\ > eF(f)} in Sanov's theorem one 
can recover (|2.3[) . 

In the context of a general functional $ Sanov's theorem may also be applied. Ei- 
ther one would try to protect against a certain undesirable shape G of the empirical 
measure by selecting n as in (|2.4p or by looking at error probabilities corresponding 
to sets of the form 

A e = {G e Mi : |*(G) - $(F)\ > e*(F)}. 
It may be pointed out that when considering a set such as A e the rate is given by 

I(A e ) = M a J(G), 

so if the inhmum is attained at G e , then limiting the probability of A e corresponds 
precisely to protecting against G e . 

2.2. Importance sampling. Importance sampling is a popular method to im- 
prove the accuracy of Monte Carlo simulation. The basic idea is to draw the 
samples from a sampling distribution that is more likely to generate samples from 
the desired region. Suppose that the goal is to evaluate &(F) for some functional 
For simplicity, start by considering the case <&(F) = F(f) — J f(x)F(dx) for 
an F-integrable, non-negative function / : X — > TZ. Let F be the chosen sampling 
distribution. For F to be a feasible sampling distribution it must hold that F <C F 
on the support of /. Then the Radon-Nikodym derivative dF/dF exists on {/ > 0} 
and it is possible to define the weight function 

w(x) = ^Z(x)I{f(x)>0}. (2.5) 
db 

Let X\, . . . , X n be an independent sample from F. The weighted empirical measure 
corresponding to the importance sampling algorithm is 

n 
k=l 

Note that in contrast to standard Monte Carlo F™ is typically not a probability 
measure. The importance sampling estimator of F(f) is the plug-in estimator 

1 ™ 

F£(/) = -X>(*')/(*i)- ( 2 -6) 
n * — ' 

i=l 

Let P and E denote the probability and expectation when X\ , Xi , . . . are sam- 
pled from the sampling distribution F. If E[exp{8w(X) f (X)}] < oo for 6 in a 
neighborhood of the origin, then Cramer's theorem implies that F™(/) satisfies a 
large deviation principle with rate function 

I w {x) =sup{9x-K w (6)}, 

9 

K w (6) =logE[exp{6w(X)f(X)}]. 
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Suppose, as for the Monte Carlo illustration that, with probability at least 1—6, 
a relative precision e is desired in the estimate. That is, the sample size n must be 
selected sufficiently high that 

P(|F-(/)-F(/)|>eF(/))<«5. 

Just as before Cramer's theorem, with A t denoting the complement of the open 
ball of radius eF(f) centered at F(f), implies that the sample size must satisfy 

n > - \- - (—log (5). 

The choice of the sampling distribution enters in the rate function through the 
weights w — (dF/dF)I{f > 0}. The improvement over standard Monte Carlo can 
be quantified by comparing the rate function / corresponding to Monte Carlo and 
the rate function I w corresponding to importance sampling. Furthermore, a good 
choice of the sampling distribution in the importance sampling algorithm is one 
that maximizes the rate I w (A e ). 

To extend the analysis to more general functionals it is desirable to have an 
analogue of Sanov's theorem for the weighted empirical measures F™. In contrast 
to Monte Carlo one cannot expect that the weighted empirical measure F™ is a 
good approximation to F everywhere. Rather the sampling distribution is selected 
to obtain a good precision in the important part of the space. If the objective is to 
compute $>(F) for some functional $, then it suffices that F™ approximates F in the 
region that largely determines $(-F). For this purpose a non- negative measurable 
function / : X — > 1Z is introduced, called the importance function. 

The rough statement that F™ is close to F in the important region is made 
precise by saying that the measure F™' is close to F* in the space Ai = Ai(X) of 
finite measures, where 

1 ™ 

= -Y / Mx k )f(x k )s Xk , 

n *■ — ' 

fc=i 

and F$ is the finite measure given by 

F f (g) = J g(x)f(x)F(dx), 

for each bounded measurable g : X — > TZ. To establish an analogue of Sanov's 
theorem for the weighted empirical measures is the main objective in this paper. 

3. A Laplace principle for weighted empirical measures 

In this section the main result of this paper is stated. It is an extension of Sanov's 
theorem to the weighted empirical measures arising in importance sampling, stated 
as a Laplace principle. 

A sequence of random variables U n taking values in a topological space IA is 
said to satisfy a Laplace principle on IA with rate function / if, for all bounded, 
continuous functions h : U — > 7Z, it satisfies 

lim -\ogE[e-" h< - u ^} = - inf {h(u) + I(u)}. (3.1) 

n— >oo n uGU 

When IA is a Polish space, the Laplace principle is equivalent to the sequence sat- 
isfying a large deviation principle with the same rate function [?J Theorems 1.2.1, 
1.2.3]. In the case of a general topological space, the relationship between the large 
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deviation principle and the Laplace principle is given by Varadhan's lemma and 
Bryc's inverse, see [3] and the references therein. 

Suppose that F and F are two given distributions on a complete, separable 
metric space X. The space M of finite measures on X is equipped with the r- 
topology; v n v if v n (g) —> v{g) for all bounded measurable g : X — > Tt. To avoid 
the subtle measurability issues of the r-topology, see [4] pp. 333-334, we will have 
to work with Tm ~ the smallest cr-ficld on M with respect to which the function 
mappings v i— > J gdv are measurable. 

Let / be an importance function. That is, / is a non-negative F-integrable 
function characterizing the importance of different regions of X . It is assumed that 
F <C F on the support of / and the weight function w is defined as in (|2.5p . Let 
Mi, the space of probability measures, be equipped with the r-topology as well 
and introduce the set T = {G G Mi : G(wf) < oo}. Define the mapping ^ from 
the subset P C Mi to M as the mapping for which ^(G; •) is the finite measure 
given by 

*(G;g)= f g(x)f(x)w(x)G(dx), (3.2) 

for each bounded measurable g : X — > 1Z. A key observation is that F™f(g) = 
~F n (wfg), where 

1 " 

k=l 

is the empirical measure obtained by sampling from F, and therefore F™^ = 
^(Fnj •). Note also that F„ belongs to T with probability 1. 
Let A c Mi be the set 

A = {G e Mi : H{G \ F) < oo} (3.3) 

We are now ready to introduce the rate function. Let I : M h-> [0, oo] be the 
function defined by 

7(i/) - vat{H{G | F) : G E T n A, *(G) = v), (3.4) 

when such G exist and I{v) = oo otherwise. Proposition 15.101 below states that I 
has sequentially compact level sets. Our main result is the following. 

Theorem 3.1 (Laplace principle for weighted empirical measures). Let F and F 
be given as above and let f be an importance function. Suppose that 

(i) there exists a function U : X —> [0, oo] such that J e u ^F(dx) < oo and U 
has relatively compact level sets, 

(ii) / e aw ^^dF{x) < oo for all a>0. 

Then the sequence {F™^} of weighted empirical measures satisfies the following 
Laplace principle on M equipped with the r-topology. For all bounded continuous 
functions h : M — > 1Z, measurable on (M,Fm), 

Km - logE[e- n ' l ^™ / )] = - inf {h(y) + L{v)}, (3.5) 

n— >oc n ueM 

with the rate function L in Q3.4p . 
The proof is given in Section O 
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Remark 3.2. Condition (i) is a version of Condition 8.2.2 in [4], adapted to the 
case of independent and identically distributed variables. In the case of real- valued 
random variables it is a very mild assumption. Take, for example, 

U(x) = max{alog(|x|), 0}, 

for some a > 0. Then U has relatively compact level sets and the condition of a 
finite expectation of is in this case weaker than 

E[|xn < oo, 

for some a > 0. 

Remark 3.3. The right-hand side of (|3.5I) can be written 

inf {h(u) + I(v)\ = inf {h(*(G)) + H(G\F)}. 

Although the expression on the left is the standard way to express the limit of the 
Laplace principle, the expression on the right better suits the weak convergence 
approach to large deviations that will be adopted throughout the proof. 

Consider the special case when the function wf is bounded. Then, \& : T —> M 
is continuous when both spaces are equipped with the r topology and Theorem 13. II 
follows essentially from a standard application of the contraction principle, see e.g. 
[H Theorem 1.3.2]. The main difficulty is to show that the Laplace principle holds 
also in the general case. 

4. Applications in performance analysis 

In this section the Laplace principle of Theorem 13.11 is applied to characterize 
the performance of an importance sampling algorithm. In the first part we outline 
a method for analyzing the performance over a collection of subsets of a target 
region A. If the sampling distribution is designed for a target set A, then one can 
expect to have good performance for subsets C of A that are not too small relative 
to A. A few rather elementary examples illustrate the performance analysis based 
on Theorem 13.11 We also discuss briefly the rare-event limit when the target set 
has small probability. The section ends with a brief discussion on performance 
analysis for importance sampling algorithms designed for computing the quantile 
of a distribution. 

4.1. Performance over a collection of subsets. In this section we are interested 
in the performance of importance sampling algorithms over a region A C X ', re- 
flected in the importance function f(x) = I{x G A}. The ideal is that the weighted 
empirical measure F™f is close to F on all measurable subsets of A. For large n 
we can imagine that the weighted empirical measure looks like typical sample from 
a measure u, which is absolutely continuous with respect to F. The performance 
of the importance sampling algorithm is good if it is likely that F™* looks like a 
typical sample from some v belonging to a set of measures for which the likelihood 
ratio dv/dF is close to 1 on A. For given e > and 8 > (where 8 = 8' 'F(A) for 
some 8' > is a reasonable choice), consider the sets 

A^ s = {v <EM: \dv/dF{x) - 1| > e for x e some C C A with F(C) > 8}, 

A+ s = {v&M: dv/dF(x) > 1 + e for x € some C C A with F(C) > 8}, (4.1) 

A~ s = {v G M : dv/dF{x) < 1 — e for x G some C c A with F(C) > 8}. (4.2) 
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The rate I{A e ,s), with / is as in Theorem l3.il can be used to evaluate the perfor- 
mance of the importance sampling algorithm. The interpretation is that e~ nI ( A e,s) 
is roughly the probability that F™^ provides an approximation of F with relative 
error greater than e for all subsets G C A that are not too small in the sense that 
F(C) > 5. The sets A^ s and A~ s have similar interpretations for overestimation 
and underestimation, respectively. The sets A^ s and A~ g are somewhat easier to 
analyze than A e> s, so for the sake of illustration they will be studied throughout 
the rest of this section. In the examples that follows we will, to keep it short, work 
exclusively with A^ s . 

If G is a probability measure such that v = *&(G), then dv/dF = dG/dF on A 
and 



I(A+ S ) = inf {h(G I F) : -X(x) > 1 + e for x G some C C A with F(C) > s\. 

dF > 

Let us compute I(A^ S ). To start off, consider a fixed set C C A. 



(1 + e)F(G) log(l + e) + (1 - (1 + e)F(C)) log (!Ji+^B) , 



Lemma 4.1. Given C C A it holds that 

inf {%(G I F) : -j(x) > 1 + e /or xec) 

L - (1 + e)F( G) 
1 - F(G) 

where the infimum is attained for the probability measure G* with 

— —fx) = 1 + e, for x £ C, 
dF 

dF 1 - F(C) 



Proof. For any probability measure G with > 1 + e on G, convexity of the 



i/G 
. dF . 

function <p(s) = s log s and Jensen's inequality implies that 

,dG . ..F(dx) - f dG, ss F(dx) 



- v ; Vf(G) F(G)/ Vf(G c ) F(G c y 

= G(G)(logG(G) - log F(G)) + (1 - G(G))[log(l - G(G)) - log(l - F(G))]. 



The last expression is convex as a function of G(G) and is minimized at G(G) 
(1 + e)F(G). We conclude that the lower bound 



inf {^(G I F) : -j(x) > 1 + e for x e g} 

> (1 + e)F(G) log(l + e) + (1 - (1 + e)F(G)) log (Izilj^M) 



holds. It is straightforward to check that the lower bound is attained by G*. This 
completes the proof. □ 
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Denote by J+ the set function given by 

J+(C) = inf \%(G \F):^j{x)>l + e for x G c}. 

The rate I(A^ S ) can be computed by minimizing the function J + over the feasible 
sets. 

Lemma 4.2. I(A+ S ) = inf {J+(C) : C C A,F(C) > 5}. 
Proof. For any C* C A such that F(C*) > S we have 

= inf {-H(G | F) : ^l(x) > 1 + e for .t e some C C A with F(C) > d] 
dF J 

< inf \%(G | F) : ^S(x) > 1 + e for x £ C*} 
= J+(C*). 

Taking infimum over feasible sets C* leads to 

I(A+ S ) <inf{j+(C*) :C* C A,F(C*) > S}. 

It remains to show the reverse inequality. For every rj > there exists a probability 
measure G* g AnT and a corresponding set C* C A, with F(C*) > S, such that 

I(A+ g ) +n > H(G*\F) > J+(C*) > inf{ J+(C) : C C A and F(C) > <5}. 

Since 77 > is arbitrary the proof is complete. □ 

The next result characterizes the minimizing set C in Lemma 14.21 in terms of the 
likelihood ratio. 

Lemma 4.3. For any t > 0, let 

C t = {x&A:^(x)>t}, 

and 6 > 0. // there exists t$ such that F(Cj ) = 8, then the infimum in Lemma \4-S\ 
is attained by . That is, 

I(A+ S ) = inf{ J + (C) : C c A,F(C) > 5} = J + (QJ. 

In general, 

su P { J+(C t ) : t > 0,F(C t ) <S}< I(A+ S ) < inf{J+((7 t ) : t > 0,F(C t ) > 6}. 

Proof. The expression for H(G* \ F) in Lemma [4.11 is increasing in F(C). Thus, 
minimizing J + (C) corresponds to taking infimum of F(C). Consider the problem 

inf{F(C) : C C A,F(C) > 6}. 

Let Fa be the collection of all measurable functions a:ii-> [0, 1]. Since indicator 
functions of subsets of A are included in Fa, 

inf{F(C) : C C A,F{C) >5}>m{J a{x)F(dx) : a £ F A , J a(x)F(dx) > <*}, 

with equality if the infimum on the right-hand-side is attained for an indicator 
function. 
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Suppose that there exists is with F(C{ S ) = S. We claim that a*(x) = I{x 6 Cj } 
is the solution to the problem on the right in the last display. To see this, take an 
arbitrary a G Fa such that J. a(x)F(dx) > 5. 

r dF 
(a(x) - a*(x))F(dx) = / (a(x) - a*(x)) — (x)F(dx) 
J A dF 

dF f dF 

(«(*) - l)~^{x)F{dx) + / a(x) — (x)F(dx) 
db JA\c- H db 

> I (a(x) - l)^F(dx) + / a{x)^F(dx) 



I- [ a{x)F(dx) - i / F{dx) 

U J A ts JCt 



> 0, 

by the requirements on a and ig. 

In the general case, it is obvious that 

inf{ J + (C) : C C A,F(C) > 6} < inf{ J+(C t ) : t such that F(C t ) > 5}. 

Moreover, for any t > such that F(Ct) < 5 the same arguments as above yields 

(a(x) - I{x G C t })F(dx) >-( a(x)F(dx) - -F{C t ) > 0, 

t J A t 

for all a G Jvi with a(x)i 7 '((ix) > S. We conclude that 



a(x)F(dx) :ae F A , a(x)F{dx) > > F(C t ), 



inf 

and as a consequence that 

inf{ J+(C) : C C A,F(C) > c5} > J+(C t ). 
The proof is completed by taking the supremum over t such that F(Ct) < 8. □ 
Denote by 7^ the function 

7+(s) = (1 + e)« bg(l + 6) + (1 - (1 + e)s) log (lziL±f>) . 

so that the expression in Lemma T4.ll coincides with 7+(F(C)). Then, 

where is is such that F(Cj ) = <5. Observe that 7+ is an increasing function. 
Therefore, a good choice of sampling distribution is one that makes F(Cj ) large. 

Next, consider the set A~ s in (|4.2j) . The following results are obtained completely 
analogously to the case Af s . J_ is the direct analogue of the set function J + . 

Lemma 4.4. Given C C A it holds that 
dG 



inf j%(G I F) : -^(x) < 1 - e /or x G C 1 } 

= (1 - e)F(C) log(l - e ) + (1 - (1 - e)F(C)) log ( 



l-(l-e)F(C) - 
1 - F(C) . 
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where the infimum is attained for the probability measure G» with 

— ^-(x) =l — e, for x £ C, 
dF 

dF 1 - F{C) 

Lemma 4.5. 

I(A- S ) = inf{J_(C) : C C A,F(C) > 5}. 
Let 7" be the function 

7e ~ (s) = (1 - e)s log(l - e) + (1 - (1 - e)s) log (IziLzfli) . 

This is an increasing function in s and thus the optimal C is given precisely by the 
Cj a in Lemma l4~3l Thus, I(A~ S ) = j~ (F(Cj )), whenever there exists tg such that 
F(Ci,) = 6. 

Example 4.6. Consider a standard Monte Carlo algorithm (i.e. F = F) and let A 
be a set with F{A) = p. Put 

A+ s = |G € Mi : -rp(x) > 1 + e for x € some C C A, F(C) > 5}. 

The rate function, given by Sanov's theorem, is the relative entropy: I MC (G) = 
H{G | F). The rate can be computed, just as in the general importance sampling 
case, as 

I MC (Ks) = inf{Jf C (C) : C C A,F(C) > 5}, 

with 

Jf C (C) = inf{H(G \F):^( X )>l + e,x£C}= J+(F(C)). 

Suppose there exists a set C such that F(C) = S. Then, since 7+ is increasing, we 
conclude that 

and, by the reasoning leading to (|2.3[) , that the number of samples needed in order 
to obtain a specific error probability is proportional to the reciprocal of the rate. 
With I IS denoting the rate function of Theorem 13.11 under the assumption that 
in Lemma 14.31 such a tg exists, an importance sampling algorithm with sampling 
distribution F has rate 

If the cost for generating one sample from F is c times the cost for generating one 
sample from F, then the reduction in computational cost is roughly 

n is I MG {A% S ) 7 +( 5 ) 



I IS (Af s ) 7 e + (^(QJ) 
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Example 4.7 (Light-tailed random walk). Let F denote the distribution of a 
normalized light-tailed random walk with m steps. That is, F is the distribution 
of S m /m = (Xi + ■ • • + X m )/m where X±, . . . , X m are independent and identically 
distributed with finite moment generating function. Take A — (a, oo) to be the set 
of interest, i.e. F(A) = P(S m > ma) and take the sampling distribution Fg as an 
exponential change of measure: 

dF e 

Here k is the log moment generating function of X\ . Consider the set Af s in (|4.1|) . 
From Lemma |4~31 the rate I(A~^ S ) is given by j^{Fg{C^ )) where 

C h = [x 6 A: -Y, x i < W e ) - (l/mjlogf*)/*}. 

i=l 

Note that, by the choice of ts, 

F e (C ie ) = E[e es -- mK ^I{a < S m /m < (n(9) - (l/m)logt 5 )/0}] 
> e m ^ K ^¥.[I{a < S m /m < (k(6) - (1/m) log t 6 )/9}} 

If the cost for each replication of the importance sampling algorithm is c times the 
cost for each replication of the standard Monte Carlo algorithm, then we conclude 
that the reduction in computational cost is given by 

A good choice of 9 is the maximizer to 9a — k(9), which is given by 9 a such that 
K / (9 a ) = a- In addition to suggesting the well known exponential change of measure 
with parameter 9 a our large deviations analysis also provides a useful upper bound 
on the reduction in computational cost. 

4.2. Applications to rare-event simulation. The efficiency analysis of the im- 
portance sampling algorithms presented so far is not targeted specifically to capture 
the performance of rare-event simulation algorithms. In this section we illustrate 
how a rare-event analysis can be performed, based on Theorem l3.11 The elementary 
examples presented in this section demonstrate that, based on Theorem 13.11 one 
can obtain similar results on rare-event efficiency as in the standard case where the 
efficiency analysis is based on the variance. 

We begin by analyzing standard Monte Carlo and the importance sampling al- 
gorithm for the light-tailed random walk with an emphasis on rare events. 

Example 4.8 (Rare-event analysis for standard Monte Carlo). Consider computing 
a probability F(A) = p by standard Monte Carlo, as in Example 14.61 Take 6 = S'p 
for some S' £ (0, 1). The performance of the algorithm can be captured by the rate 
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I MC {A% s , p )= 1 +{5'pl where 

7+(*'p) = j*5'(l + 6) log(l + e) + (1 - P 8'(l + e)) log f — ff^O 
= p<5'[(l + e)log(l + e )-e] + (p 2 ). 

Thus, as p — > 0, the rate decays linearly with p and as a consequence the sample 
size needed for a given precision increases proportionally to I /p. 

Example 4.9 (Rare-event analysis for the light-tailed random walk). In the ran- 
dom walk example, Example 14.71 the performance of the importance sampling 
algorithm, with 9 = 9 a , is captured by <-y+(e m ( eaa ~ K ( 6 ' a ))<5). Taking S — 5'p m where 
Pm = ^(Sm > ma) it follows that 

j+(e m V" a -« e «»6'p m ) = e m ^ a -^ Pm S'[(l + e) log(l + e) - e] 

+ (e 2m ^ a - K ^ V„). 

The reduction in computational cost for the importance sampler vs. the standard 
Monte Carlo algorithm is then bounded from above by 

c +, 7; {S ' P ,?l » ~ ce- m ^ a -^\ as m -> oo. 

The conclusion is that the reduction in computational cost is exponential in m. 

We end this section by demonstrating the performance of the the so-called zero- 
variance change of measure [U p. 127]. This choice of sampling distribution is 
optimal for estimating a probability in the sense that the variance of the estima- 
tor is zero and is often used as a reference point for designing efficient sampling 
distributions. 

Example 4.10 (Zero- variance change of measure). Consider the probability p — 
F(A), for some A C X and distribution F and take the importance function to be 
I{x € A}. The zero- variance sampling distribution is the distribution F given by 

dF . . I{x e A} 

— [x) = . 

dF K ' p 

Let 8 1 € (0, 1) and 5 — S'p. The likelihood ratio is constant over the entire set A 
and F(C) = F(C)p^ for any C C A. That is, F(C) = F(C \ A), the conditional 
probability under F of C given A. It follows that 

AC 

J + (C) = mf{H(G | F) : -=-(*) > 1 + e, x 6 C} = 7 +(F(C)) = it{F{C)/p), 
at 

and the rate 

i(At s , P ) = it(n 

which is independent of p. 

For 6' = 1, the case of having a relative error e on the estimate of p, the rate 
corresponding to the zero- variance change of measure is +oo. This follows from 
the fact that no probability measure G which is absolutely continuous with respect 
to F on A, and thus must satisfy G(A) = 1, can give rise to such an error. In 
this particular case the rate (+oo) can be obtained without using large deviations 
results. 
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4.3. Performance analysis for computing quantiles. Let the underlying space 
X be the real line and consider computing a quantile of a distribution F on 1Z. For 
a G (0, 1) the a-quantile of a finite measure v on 1Z is defined by 

®a( v ) = inf{ic : v(x, oo) < a}. 

For e > 0, consider the set 

Af = {v E M : $ a (u) > (1 + e)<Z> a (F)}. 

Let F be the sampling distribution of an importance sampling algorithm for com- 
puting $ Q (F). Suppose that the importance function is an indicator I{x € (a, oo)} 
where a < <& a (F) and let I be the rate function in Theorem 13. II The performance 
of the importance sampling algorithm can be quantified by I(Aj). Of course one 
may also consider A~ defined in the obvious way, but for this illustration we work 
exclusively with Aj . 

Let us compute the rate I(A£). First note that, with q ay<i = (1 + e)$f a (F), 

At = {v E M: <$> a {L>) > q a ^} = {v eM: v(q aie ,oo) > a} 
and the rate is therefore given by 

I{A+) = m£{H(G \F):v = *(G), °°) > a} 

= inf |h(G I F) : J I{x > q a , e }w(x)G(dx) > a j . 
The infimum is attained at G* given by 

d,G* p^k(x) 

^(x) 



dF " ' M(A) ' 

where k{x) = I{x > q a ^}w(x), M(X) — j e xk ^F(dx) and A is given as the solution 
to 

dxM(X) 

To see that the infimum is indeed attained at G*, note that by the variational 
formula for relative entropy [4, Proposition 4.5.1], for all A > and G G Aii such 
that J k(x)G{dx) > a, 

H(G | F) > \a — logM(A), 

and the inequality is satisfied with equality for G*. We have just proved the 
following. 

Proposition 4.11. I(At) = Act — logM(A) where A is determined by (|4.3p . 

Example 4.12. For a standard Monte Carlo algorithm the rate I (At) can be 
explicitly computed. Indeed, in this case k(x) = I{x > q a ,e] and 

M(A) =e x Pa , e + l- Pa , ei 

with p Qj£ = F(q a , e , oo) < a. The equation for A becomes 

d\M(\) = eV, e 
M(A) e*Pa, 6 + 1 - p Qi£ ' 
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which leads to 

-p a ,e) 

(1 - a)p a . e 



A = log 
and finally 

I(A+) =Xa- logM(A) = a log (—) + (1 - a) log ( 1 ~ a ) = %{a \ p a>e ), 

where H.(a \ p a ,e) refers to the relative entropy between two Bernoulli distributions 
with parameters a and p Q , e , respectively. 

For a general importance sampling algorithm the expression for I(Aj) in Propo- 
sition HUT] has to be worked out on a case-by-case basis. 

5. Proof of Theorem 13.11 

In this section the proof of the Laplace principle for the weighted empirical mea- 
sures of importance sampling is presented. The proof relies on the weak convergence 
approach developed by Dupuis and Ellis [4]. The three main steps of the proof are: 

(1) Derive a representation formula for 

W n = --logE[e-" /t( ^" /) ]. 
n 

This is achieved by formulating a stochastic control problem that has a 
minimal cost function which is equal to W n . In the setting considered 
here, the representation formula reads 

n— 1 

W n = inf E[- J2 nGnA- I P»j) I F) + h(F n )}, (5.1) 

{G„, 3 } n j=o 

where F n j is the controlled process (empirical measure), F nj +i = F„j + 
. and F n = i ^2"Zo <hc n ■> obtained by sampling X n j from the dis- 
tribution (control) G n ,j{- \ F„j). 

(2) The representation formula (|5.1j) is used to prove the Laplace principle 
lower bound, 

liminf - logE[ e -"' l( * , " /) ] > - inf {h(^(G)) + H(G I F)}. 
n n GeAnr 

(3) The third and most involved step is to use the representation formula to 
prove the Laplace principle upper bound, 

limsup-logE[e- n ' l( * r ™ /) ] < - inf {h(V(G))+H(G \ F)}. 

The steps (l)-(3) are precisely those taken in [4] for proving Sanov's theorem. The 
main difficulty is to prove the upper bound, in step (3). Our proof of Theorem 13. II 
is, for the most part, a transcription of the proof in [?]. The main new difficulties 
are that the mapping <]/ is defined on a subset of Mi and may be unbounded. 
The first point is handled by making minor adaptions to the arguments in |3] and 
the unboundedness of ^ is mainly treated in Lemma 15.91 below. In order to make 
the paper self-contained results and constructions that are very similar to [4] are 
included and the corresponding references provided. In many cases the notation is 
consistent with that of [4], to make comparisons easier. 
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5.1. Representation formula. In this section (|5.1j) is shown to be equal to 

n 

The quantity W n is the one that appears in the Laplace principle (with the sign 
changed) and (|5.ip is derived by considering a related stochastic control problem 
described below. The corresponding minimal cost function can be used as a rep- 
resentation of W n . For a more thorough discussion of this control problem see [4j 
Section 2.3]. 

The difference from Sanov's theorem is that only a subset T — {G G M\ : 
G(wf) < oo} of the space of probability measures is under consideration. The 
proof is therefore very similar to the standard case. Recall that f% f (g) = F n (wfg) 
for each measurable g. In particular, 

nK f {Ix)\ = E[fn(wf)] = F(f) < oo, 

and it follows that F„ G T with probability 1. Let F„ = T and define, for j = 
0, . . . , n — 1, recursively the sets Fj C M.j/ n by 

Tj = {Ge M j/n : F({y : G + ±6 V € T 3+1 }) = 1}. 

That is, if G € Tj, then sampling Y from F implies that G + u~ 1 5y € ^j+i with 
probability 1. 

Let A4q be the one point set containing only the null measure and introduce the 
measurable mapping W n : UsUoi^} x ^fe/n — * H = [—oo, oo] by 




-nh(*(P„)) | F nJ = G , fotGeTj, 

for G e r|, 



(5.2) 



W n (n,G) = h{G), (5.3) 



//i(*(G)), for G G F, 

3, for G e r c 



(5.4) 



In particular, we set W n = W n (0, 0). Note that in $5/2$) G € Mj/ n (^) is a subprob- 
ability measure, and the F n j = (1/n) J3i=o ^x- are the empirical subprobability 
measures obtained by sampling the Xi's from F. Since the F nj form a Markov 
chain, one can obtain the recursion formula 

W n (j,G) = -^log J e- nWn ^ +l > G+ ^F{dx). 

Since h is bounded and continuous the mapping W n is measurable, bounded from 
below, and bounded from above on T. Together with the recursion formula above, 
Proposition 4.5.1 in 14] gives that W n (j, G) can be written as 

W n (j,G) = inf \ -%{G | F) + / W n (j + 1,G+ -S x )G(dx)\ , (5.5) 

for j = 0, n — 1, where 

A = {G g Mi : U{G \ F) < oo}. 
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Moreover, the infimum is attained at G n ,j G A denned by the Radon-Nikodym 
derivative 

dG n ■ e -w n (j+i,G+^s x ) 

~iF {x) = J e - Wn ^ G +^F(dy)' (5 ' 6) 

To derive the representation formula for W n , consider the following related sto- 
chastic control problem. For n S N and j E {0, 1, ...,n}, let G n .j be a stochastic 
kernel on X given M.u n . A controlled process {F n j} is defined by P ni o = F„,o = 
and 

3-1 



\ J 

Fn,7 — / , $~x i.) Fyi — F nn . 

n z — ' • 

fe=0 



where the conditional distribution of X n ,k given F n! o, F n> i, F n .k is 

P(X n , k e dx I F„, 0! F„,i, -,F„, fc ) = G nj -(dx | F n , fc ). 

All random variables and the corresponding (controlled) empirical subprobability 
measures are for all n defined on a common probability space (CI, J 7 , P) which will be 
used throughout the paper. For these dynamics define the minimal cost functions, 
for j G {0, 1, n} and G G Mj/ n , 

n — 1 

W n {j,G) = inf E[- VM(C„,s(. |F n , fe ) | F) + /i(F n ) | F n j = G]. (5.7) 

{G„,j} n 

For j = and G = 0we set 

n— 1 

^=^(0,0)= inf E[~ r«(G„4-|F M ) |F)+/ l (F„)]. (5.8) 
{G„.j} n 



fc=0 



Proposition 5.1. Let W be given by |5. 7| ) and W" 6e £/ie solution to \5. Sty and 
(El). ThenW n = W"'. 



Proof. We begin by proving that VF (j, G) > W n (j, G) using backwards induction 
on j. Fix a control sequence {G nj } and let {F nj } denote the associated controlled 
process. Let r„ = V and define recursively for j = 0,1,..., n — 1 the sets Tj 
associated with the control sequence {G„.j} by 

r, = {Ge M j/n : G n>J ({y :G+U y £ T j+1 }) = 1}. 

The definition is such that if at time j the controlled process F n ,j lies in T^, then 
by sampling from G n ,j+i, G n ,„-i the controlled process F n will belong to T with 
probability 1. 

Consider first the case j ~ n. Clearly, 

f (n, G) = h{G) = W n (n, G), 

and the claim is trivial for this j. 
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Take j = n — 1. Suppose that G G r„_x, so that G njTl _i({j/ : G + ^-S y G T„)} | 
G) = 1. Using (J53J), (EH) and ([53]) . we have 

E[-H(G„,„_ 1 (- | F„ n _i) | F) + ft(F„) | F„, n _ 1 = G] 
n 

= 1[-«(G„,„_ 1 | P„,»-i) I F) + MF„) + f W n (n,F n )dG n n -i 
n J 

- J W n (n,F n )dG n , n - 1 | F n , n _x = G] 

> E[W n (n - l,P n ,„_i) +h(F n ) - /" W»(n,F„)dG„,„_i | F„ iW _ a - G] 
= W™(n-1,G). 

If instead G G r^_ x , then G rii „_i({?/ : G + ~5 y G T n )} | G) < 1 and since h is 
infinite on r°, this implies 

I[^(G n ,„_!(- | P n , n _0 | F) +7i(F„) | F„, n _i = G] = oo. 
This shows that 

E[i^(G„, n _!(- | P„, n _i) | F) +7j(F„) I F n>n _i = G] > vr{n-l,G), 

for any choice of G. 

Proceeding similarly for j = n — 2, n — 3, then shows that 



n— 1 

I[- V W(Gn,fc(- I Fn.fc) | F) + h(F n ) | F nj - = G] > W n (j,G), 



fc =J 

for all G and j. Taking infimum over all admissible control sequences {G n j} C A 
proves the inequality. 

Next, the reverse inequality W (j, G) < W n (j,G) is proved. For this, consider 
the control sequence {G n ,j} defined by (15.6[) . For this sequence it holds that Tj = Tj 
for all j. 

The case j — n was handled above and thus we start by considering j = n — 1. 
If G G r n _i, then by the definition of G„.„_i and (|5.5I) . 



E[-H(G„,„_!(- | F n ,„_i) | F) + /i(F„) | F n n _i - G] 
n 

= I[i^(G„, n _ 1 (- | F„ in _!) | F) + W n (n,F n ) \ F n . n ^ = G] 

n 

= I[-H(G„,„_ 1 (- | F„, n _i) | F) | F n n _i = G] +E[W n (n,F n ) | F n , n -i = G] 

71 

= i«(G n , n _i(. I G) | F) + fw n (n,G+-S v )G n , n - 1 {dy \ G) 
n J n 

= W n (n-l,G). 

If instead G G r£_ 1; then G n! „_i({y : G+ i<5 y G T} | G) < 1 and since ft, is infinite 
on r c we have 

I[i^(G n , n _ x (- I F n ,„_!) | F) + h(F n ) | F n n _i = G] = oo = W n (n-l,G). 
n 
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This shows that 

I[-«(G n>n _i(. | F n , n _i) | F) + h(F n ) | F„, n _i = G] = W"(n-1,G), 
for all G. Proceeding similarly for j = n — 2, shows that 

n— 1 

E[- ^ W(G n , fc (. | F„, fc ) | F) + h(F n ) | F nj = G] = W n (j,G), 

k=j 

for all j and G. Taking infimum over all admissible control sequences {G„j} in A 
yields the desired inequality. This completes the proof. □ 

5.2. Laplace principle lower bound. In this section the Laplace principle lower 
bound, 

liminf ilogE[e-"' l( ^ /) ] > - inf {h(9(G)) + H(G I F)}, (5.9) 
n n GeAnr 

is proved. With W n as in (|5.2p . proving this bound is equivalent to proving the 
upper bound 

limsupW^™< inf {fe(¥(GQ) + H(G \ F)}. (5.10) 
To this end the representation formula for W n derived in Proposition ^. li 

1 n— 1 

W n (j, G) = inf E[- V U(G n , k {- | F n , fe ) | F) + h(F n ) \ F n>j = G], 

{Gn.j} n . 

A:— j 

is used. The following strong law of large numbers will play a role in proving the 
lower bound (|5.9p . 

Proposition 5.2 (Strong law of large numbers under importance sampling). Let f 

be non-negative, measurable and F -integrable. Let {Xj} be independent and iden- 
tically distributed with common distribution F. Let F^ be the weighted measure in 
M. determined by 

n— 1 

F £(s) = -£/P0M^ 

for each bounded measurable function g. Then, with probability 1, 

F f F f 

in M. 

A proof for the corresponding result for empirical measures (i.e., no weights) is 
found in [U pp. 49-50] and the proof for the case of weighted measures is a direct 
analogue. 

Fix a probability measure G £ A n T. Define the control sequence {G n j} 
by G nj = G for each j — 0, 1, ... 7 n — 1, so that in every step, the control does 
not depend on the controlled process. Then all the Xn^s are independent and 
identically distributed with common distribution G and the associated controlled 
process F„ belongs to T with probability 1. Using the representation formula, it 
follows that 

n— 1 

W n < f[_ V H(G | F) + h(F n )} = H{G | F) + E[h(*(F n ))}. 
n * — ' 

fc=o 
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The function wf is non-negative, measurable and G-integrable. By the choice G 
and how the controlled process F„ is obtained, Proposition 15.21 implies that with 
probability 1, ^(F^-) ^(G; •). Moreover, h is bounded and continuous with 
respect to the r-topology and thus h(^f(F n )) — > h(^f(G)) with probability 1. By 
the dominated convergence theorem, 

E[h(*(F n ))} -»• fc(*(G)). 

From this we conclude that for any G G A n T 

limsupW™ < H(G | F) + h(V{G)). 

n 

Finally, taking infimum over G in A Pi F on the right hand side proves the upper 
bound (|5.10p . and thus the Laplace principle lower bound. 

5.3. Laplace principle upper bound. Analogously to the lower bound the 
Laplace principle upper bound, 

limsup - \ogE[e- nh ^ f '>} < - inf [/i(#(G)) + H(G \ F)], (5.11) 
'ii ft c?(zAnr 

can be stated as a lower limit for the minimal cost W n , 

liminfW"> inf \h(*(G)) + H(G I F)}. (5.12) 

It therefore suffices to prove this lower limit for W n in order to obtain the Laplace 
principle upper bound. To prove (|5 . 12[) it is enough to show that every subsequence 
has a further subsequence that satisfies the lower limit. Therefore, we henceforth 
work with a fixed subsequence also denoted by W n . 

Since H(- \ F) is a convex function [H Propisition 1.4.3], 

1 n— 1 1 n — 1 

- ^ U(G n j(- | F n>j ) | F) > H(- GnA- I F nj ) | F), 

3=0 U j=0 

and together with the representation formula this implies, with 
G n = (l/n)Y:VoG n , j (-\F n j), 

W n > inf E[H(G n | F) + h(F n )]. 

{G n j} 

Given e > 0, there exists a control sequence {G n j} and associated G n such that 

W n + e > E[H(G n | F) + h(F n )] (5.13) 

There is no restriction on assuming G n ,j S A n T for each j and for the remainder 
of this section this assumption is made. 

To show the Laplace principle upper bound it is now enough to prove the fol- 
lowing result. 

Proposition 5.3. Every subsequence of {(G n ,F n )} has a further subsequence, also 
denoted {(G n ,F n )}, such that (*(G„), *(F„)) (*(F),*(F)) with probability 1 
along this subsequence, and where F belongs to A n T with probability 1. 
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Assume for now that Proposition 15.31 holds. The proof of the Laplace principle 
upper bound then follows from Fatou's lemma and the lower semi-continuity of the 
relative entropy mapping G t-> H(G \ F): 

e + liminf W n > liminf E[H(G n | F) + ft(P„)l 

n n 

> Epiminf U(G n I F) + liminf ft(*(F n ))] 

n n 

>E[H(F | F) + h(F)] 

- G |^nr{ H(G|#) + MVl/(G)) }' 

where in the last step we used that F is in A n T with probability 1 . 

Proposition 15.31 is proved by a series of lemmas. The idea is to first work with 
Ail equipped with the weak topology and show that {(G„,F„)} is tight in this 
topology. By showing that E[H(G n \F)] is uniformly bounded (Lemma I5.4|) the 
tightness of {(G„,F„)} is obtained by showing tightness of each of the marginals. 
Prohorov's theorem implies relative compactness and thus each subsequence has 
a sub-subsequence converging to some random element (G, F). Lemma 15.61 which 
corresponds to [H Lemma 2.5.1], concludes that G — F w.p. 1, thus establishing 
that w.p. 1 for each subsequence, 

(G„,F„) A (F,F), 

along some further subsequence. Next, in Lemma 15.71 it is established that the 
convergences of the marginals G n and F„ to F are still valid when M\ is equipped 
with the r-topology. The main ingredient of the proof is an approximation argument 
introduced in [4j Lemma 9.3.3] and Lemma 15.71 below is a version of that result in 
the simpler setting where the underlying random variables are independent and 
identically distributed. Up to this point the proof is as in [1] with minor changes. 

Once the convergence in M.i equipped with the r-topology is established w.p. 1, 
it remains to show that it is preserved under the mapping ^ . Lemma [5 . Sl proves that 
F is in F w.p. 1 and thus that ^(F; ■) is well-defined. The main additional difficulty 
is handled in Lemma l5"Ul where a truncation argument is used to prove that ^(G n ; •) 
converges to ^(F; •) and $(F„; •) converges to ^(F; •) in the r-topology on A4. 

Lemma 5.4. For a sequence {G n } of control sequences such that (|5.13[) holds and 
G n ,j £ r n A for each j and n, it holds that 

supE[T£(G„ | F)} < oo. 

n 

Proof. Since h is bounded on T, it is possible to find a constant M < oo such that 
sup Ggr | h(G) |< M. The choice of control sequence {G nj } implies that F n 6 V 
with probability 1 for all n. Therefore, 

supE[-H(G„ | F)} = su P I[H(G„ | F) - M] + M 

n n 

< su P I[H(G„ | F) + h(F n )} + M 

n 

< sup{W n + e) + M 

n 

= sup W n + e + M. 
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By the proof of the Laplace principle lower bound we have limsup n W n < oo, and 
together with E[H(G n | F)] < oo for each n this implies that sup n E[H(G n \ F)] < 
oo. □ 

The uniform boundedness just established is used to prove the tightness of the 
sequence of admissible control measures. 

Lemma 5.5. Under condition (i) of Theorem \3.1[ the sequence 

^ G »>i0 I F «) X F n j = {(G n X F„)} , 

of admissible control measures in Mi x M.\ is tight in the weak topology. 

Lemma 13751 is a special case of Proposition 8.2.5 in [3] that establishes the result 
in the more general context of Markov chains. The proof is therefore omitted. 

Having established the tightness of the distributions of {(G„, F„)}, the next step 
is to extend this to almost sure convergence of subsequences , in the weak topology, 
and to study the limit. 

Lemma 5.6. Given any subsequence of {(G n , F„)}, there exists a further subse- 
quence that converges in distribution to some random variable (G,F) 7 where G = F 
a.s. 

The result in Lemma [531 is practically identical to part (b) of Lemma 2.5.1 in [3]. 
The only difference is that we must now appeal to the tightness proved in Lemma 
15.51 whereas in [4] the underlying space is assumed to be compact. We omit the 
proof. 

The next step is to show that this weak convergence of subsubsequences actually 
implies convergence of G n and F„ in the r-topology. The result is a version of 
Lemma 9.3.3 in [4] adapted to the case of independent and identically distributed 
random variables. Recall that we are already working with a specific subsubse- 
quence (indexed by n) and on a probability space where the convergences G n F 
and F„ A F both occur with probability 1. Henceforth, M.\ will be equipped with 
the r-topology. 

Lemma 5.7. Under the conditions of Lemmas \5.$5. b\ there exists some subse- 
quence of n £ N such that G n F and F„ F w.p. 1 along this subsequence. 

An important ingredient in the proof is the inequality 

ab < e aa + -01og(6) -b+ 1), (5.14) 
a 

see the proof of [H Lemma 9.3.3]. This inequality will also appear in what follows. 

With the almost sure convergence in the r-topology established the final results 
needed to prove Proposition 15.31 are obtained in Lemmas 15.81 and 15.91 The first 
result is that with probability 1 the limit measure F is in the desired region of M.\. 

Lemma 5.8. Under the assumption J e aw f dF < oo for any a > 0, it holds that 

supE[G„(w/)] < oc, and supE[F„(w/)] < oo. 

n n 

It follows that F G r w.p. 1. 
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Proof. Lemma E3] shows that swp n E[H(G n | P)] < oo. Hence, each G n has al- 
most surely a well-defined Radon-Nikodym derivative w n with respect to F and by 
definition 



H(G n \F) = J w n \og(w n )dP. 

Since w, f and w n are all non-negative functions, the inequality (|5 . 14[) with a = wf, 
b = w n and a = 1 gives 

Gn(wf) = J wfdGn = J wfw n dF 

< ( e w fdF + { (w n logK) -w n - l)dF 



e wf dF + J log(w n )dG n 
J e wf dF + H{G n | F). 



Thus, 



E[G n (wf)] < E[H(G n \F)}+ e wf dF 



and 

supE[G„(w/)] < su P I[^(G„ \F)}+ [ e wf dF. 

n n J 

By the assumption and Lemma 15.41 it holds that 

supE[G„(w/)] < oo. 



Lemma 15.71 proves that G n F with probability 1. For m G N, wf Am is a 
bounded, measurable function and the r-convergence implies that 



oo > limsupE[G n (iu/)] > limsupE[G„(w/ A m)] = E[F (wf A m)}. 



Taking limsup as m | oo together with a repeated use of Fatou's lemma gives 
E[F(w/)] < oo. This is a non-negative random variable and it follows that F 6 L 
a.s. 
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That also sup„ E[F„ («;/)] < oo is proved by a repeated conditioning on the 
controlled process. 



^ 71 — 1 

Wn(wf)} = I[- ^n^fiXnj)} 
71 3=0 

"_ 1 - 1 ™~ 2 

n n j=0 

^ J w{x)f{x)G n ^_ 1 {dx | F n ,„_i) + F„,„_ 



= E 



= E 



-G ni „-i(wf | F n , n _i) 



n 



n— 3 



— 1 1 - 

+ E[- W (X„, n _ 2 )/(X„, n _ 2 ) + -Y, w(X ntj )f(X n>j ) | F n ,„_ 2 ] 

71 3=0 

= I[-G„, n _i(«;/ | F„, n _i) + -G n , n - 2 {wf | F n , n _ 2 ) +F n , n _ 2 ]. 
n n 

Proceeding like this one obtains 

E[F n ( w /)] = E[^G„, (w;/ | F„, ) + ... + | F„, n _i)] 

— 1 ™ _1 /■ 
= E[-^ / w(x)f(x)G n j(dx | F nj )] 

= E[G n (wf)\. 
This completes the proof. 



□ 



Next it is shown that the almost sure convergence in the r-topology on Mi 
implies almost sure convergence in the r-topology on M for the corresponding 
mapped measures ^(G n ; •) and *(F„; •). 

Lemma 5.9. Along the subsequence for which the convergence in the T-topology 
holds, the convergences 

*(G„; •) ^ *(F; •), and * (F n ; •) ^ *(F; •), 

Proof. Define for G € .Mi and to 6 N a truncated version "J^G; •) of the mapping 
^ as the finite measure given by 



* m (G;g) = / (wf Am)gdG, 



for each bounded, measurable function g. The function wf A to is bounded and 
measurable and thus $ m is continuous with respect to the r-topology on Mi- 
Therefore, 

m (G„; g) = j {wf A m)gd~G n -> ^ (10/ A to). 9 c?F = * m (F; 3 ), 
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as n — 5- oo along the particular subsequence. That is, for any bounded, measurable 
function g it holds w.p. 1 that 

lim ^ m (G n ;g) = * m (F; 5 ). 

n— >oo 

Moreover, by Lemma HTSl the function wf is a.s. integrable with respect to F. Since 
(wf A m)g — > wf g as m — > oo, the dominated convergence theorem implies that 
with probability 1, 

lim lim <F m (G„; 5 ) = *(F; 3 ), 

m— too n— >oo 

for every bounded, measurable function g. Therefore, the desired convergence in 
M. with the r-topology will follow if the order of the limit operators can be inter- 
changed, which holds if 



lim sup 



gwfdG n - / g(wf A m)dG n 



= 0. 



(5.15) 



In proving Lemma 15.71 an application of the Skorohod representation theorem re- 
sults in the sequence {H(G n | F)} being bounded a.s. and (|5.15j) then follows 
by an application of the inequality (|5.14p : the argument goes precisely as in part 
(c) of Lemma 1.4.3 in [3]. This proves the part concerning the admissible control 
sequence. 

To show convergence of f (F„;-) an argument similar to that of Lemma [5.81 is 
used. This line of reasoning is found in Lemmas 8.2.7. and 9.3.3. in [4|. The aim 
is to show that for each bounded measurable function g, *&(F n ;g) fy(G n ;g) in 
probability in a way such that for indicator functions g = Ia we can appeal to 
the first Borel-Cantelli lemma to get almost sure convergence of the entire measure 
along some subsequence. For this, take any e > and consider 



(1/ 



wfgdF 71 



wfgdGn 



> 3e 



< 



J (wf A m)gdF n - J wfgdF n > ej 



(wf A m)gdF n — / (wf A m)gdG n > e 



(wf A m)gdG r . 



wfgdG n > e I , 



which holds for any m > 0. For neN and j £ {0, 1, n — 1}, define the u-algebra 
generated by the controlled process up to time j, 



J~ n 



cr (F„ i o, F n i , F r , 



Recall that 



P(X nJ £ dy | F n)0 ,F n ,i,...,F nJ ) = G nJ (dy \ F nJ ), 

that is G„j(- | Fnj) is a regular conditional distribution for X n ,j given J„j. Now 
condition on the J nj 's to relate expectation of integrals with respect to F n j to 
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integrals with respect to the control measures G n j. 

F( \f(wfA m)gdF n - J wfgdF n 



> € 



e 



< 



Iff I 



e 




11.91 


oo 


e 




\\9\ 


oo 


e 




\\g\ 


OC 


e 




\\g\ 


OO 



E 



E 



E 



E 



E 



(to/ A m)gdF n - J wfgdF n 
[[(wf-wfA m)dF n 

f-^W-ffi/Am)^) 

- n — 1 

[- Y,W w f - wfAm)(X n!j ) | 7 nd ] 

n 3=0 
-. n— 1 « 

[- E / fa/ - ™/ A m ) dG > 

[ (wf -wf A m)dG n 



Analogously, it holds that 
P( | / O/ A m). 9 dG„ - { wfgdGn | > e) 



< IM^E 



(w/ — wf A m)dG n 



Since w/ — w/ A m is non-negative and G n w.p. 1 has a Radon- Nikodym derivative 
w„ with respect to F, we can apply the inequality f|5.14[) . For any a > 1, 



E 



O/ - to/ A m)dG„J < / e ^ wf - wf ^dF + -E[H(G n \ F)]. 
Finally, consider 

P( \j(wfA m)gdF n - J (wf A m)gd~G n \ > e) , 

for which we use a similar argument as in the proof of |4] Lemma 8.2.7]. For any 
bounded measurable function g : X — > 1Z, 



E[(wf A m)g(X n ,j) - J (wf A m)gdG n , 3 (x \ F n j) \ F n ,j] 

= E[(wf A m)g(X ni j) \ T n>j ] - G n<j ((wf A m)g \ F nJ ) 
= G n j((wf A m)g \ F n>j ) - G n>j ((wf A m)g \ F n ,j) = 

-a.s. since G n j(- \ F n j) is a regular conditional distribution of X n j. Hence, 

{(wf A m)g(X nd ) - G n ,j((wf A m)g \ F nj )} 3=Q1 n _ v 



EMPIRICAL MEASURES IN IMPORTANCE SAMPLING 



2!) 



is a martingale difference sequence with respect to J„j. Moreover, for any e > 0, 
P( j (wf A m)gdF n — j (wf A m)gd~G n > ej 

(wf A m)gdG n 



(wf A m)gdF r 



n-l 



-L(Y,(( w f - G„j((t»/ A m) 9 | Fn.j}}} 2 



3=0 
n-l 



-I A m)g{X nJ ) - G nd ((wf A m) 5 | F nj )) 2 



3=0 



((tu/ A m)g(X nt i) - G„,,((w/ A m)g | F n)i ) 

i=li3^s 

x ((w/ A m)g(X n ,j) - G„ tj ((wf A m)s | F nJ )) 



The second term vanishes when conditioning on the cr-algebras {J nj }. For the 
first term inside the expectation an upper bound on each of the summands is 

((io/Am)ff(X nJ 0-Gnj((tfi/Am)5|P n j)) a <4|MlLn» a - 
It follows that 



< 



-j ( £)((«;/ A m) 5 (X„,,) - G nJ ((«;/ A m)<? | F^)))' 

3=0 

4||3ll^™ 2 



This all adds up to 

|*(F n ;tu/0)-*(B n ;u;/5)|>3e 

IpIIc 



< 



( W f-w f Am) d p + imn(G n I f)]) + 



4||g|Um- 2 



which goes to if we send n, m and a to +oo in that order. This holds due to the 
assumption f e awf dF < oo, and Lemma 15^1 

To complete the proof we want to use the same argument as in Lemma 9.3.3 in 
[I] together with the convergence for ^(G n ; •). For this it is not enough to have the 
above convergence in probability but one needs to be able to find a subsequence for 
which the probabilities converge fast enough for an application of the first Borcl- 
Cantelli lemma. Say that we want a subsequence such that for n > rik, 

P(|*(F„; wfg) - *(G n ;wfg)\ > 3e) < 2~ k . 

Start by picking a tik such that 

— su P E[H(G„ \F)\ < -— 

<Jk n 6 
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which is again possible due to Lemma [5.41 Having chosen this ak, pick mk suffi- 
ciently large so that 

o-fc, 

e Ck(wf-wf/\m k )^p < 



3 

This is possible due to the assumption F{e aw *) < oo for every a > 0. Finally, pick 
rik such that 



< 



rik ~ 12 

If g is an indicator function, the above will yield a sequence {n^} such that the 
probability of interest is smaller than 2~ k . The key to the argument used in Lemma 
9.3.3 in [4] is that, since the underlying space X is assumed complete and separable, 
the Borel cr-algebra on X is generated by a countable collection of Borel sets. 
Together with the above this yields the desired result, namely that 

*(F n ; •) ^ *(F; •), 

in Ai w.p. 1. We refer the reader to [4] for the details. □ 
Lemma 15.4115.91 completes the proof of Proposition 15.31 

We end this section by proving that the function I in (|3.4p has sequentially 
compact level sets in the r-topology. 

Proposition 5.10. The function I : M. — > [0, oo] in (|3.4[) has sequentially compact 
level sets on M. equipped with the r-topology. 

Proof. Let 

C{K) = {veM: < K}, 
for K < oo. Take any sequence {v n } C C(K). Since I{vj) < K for each j, there 
exists a sequence {G n } C AnT such that #(Gj) = Vj. Moreover, it must hold that 
H(Gj | F) < K + e for every e > and each j. Hence, 

sup"H(Gj | F) < oo. 

j 

The relative entropy has compact level sets in the r-topology [H Proposition 9.3.6]. 
Therefore, there exists some subsequence, also indexed by n, and some G* such 
that 

and the corresponding (finite) measures v n = ty(G n ), = ^(G*) are in the set 
C(K). Remains to prove that v n v*. To this end, we note that by the same 
arguments as in Lemma l5.8l it holds that 

sup G n (wf) < oo, 

n 

and 

G*(wf) < oo. 

The conditions used to prove Lemma [5^1 are therefore satisfied and it follows in the 
same way that for every bounded, measurable g, 

v„(g) = V(G n ;g) -> *(G*; 5 ) = i/.fo) 

as n — > oo. Hence, v n and the level set C(K) is indeed compact in the 

r-topology □ 
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